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Abstract.  We  explore  the  use  of  Optimal  Mixture  Models  to  represent  topics.  We 
analyze  two  broad  classes  of  mixture  models:  set-based  and  weighted.  We  provide 
an  original  proof  that  estimation  of  set-based  models  is  NP-hard,  and  therefore 
not  feasible.  We  argue  that  weighted  models  are  superior  to  set-based  models, 
and  the  solution  can  be  estimated  by  a  simple  gradient  descent  technique.  We 
demonstrate  that  Optimal  Mixture  Models  can  be  successfully  applied  to  the  task 
of  document  retrieval.  Our  experiments  show  that  weighted  mixtures  outperform 
a  simple  language  modeling  baseline.  We  also  observe  that  weighted  mixtures  are 
more  robust  than  other  approaches  of  estimating  topical  models. 


1  Introduction 

Statistical  Language  Modeling  approaches  have  been  steadily  gaining  popularity  in  the 
field  of  Information  Retrieval.  They  were  first  introduced  by  Ponte  and  Croft  [18], 
and  were  expanded  upon  in  a  number  of  following  publications  [4, 15, 24, 8, 9, 1 1, 14]. 
These  approaches  have  proven  to  be  very  effective  in  a  number  of  applications,  includ¬ 
ing  ad-hoc  retrieval  [18,4, 15],  topic  detection  and  tracking  [26, 10],  summarization 
[5],  question  answering  [3],  text  segmentation  [2],  and  other  tasks.  The  main  strength 
of  Language  Modeling  techniques  lies  in  very  careful  estimation  of  word  probabilities, 
something  that  has  been  done  in  a  heuristic  fashion  in  prior  research  on  Information 
Retrieval  [21,19,20,25]. 

A  common  theme  in  Language  Modeling  approaches  is  that  natural  language  is 
viewed  as  a  result  of  repeated  sampling  from  some  underlying  probability  distribution 
over  the  vocabulary.  If  one  accepts  that  model  of  text  generation,  many  Information 
Retrieval  problems  can  be  re-cast  in  terms  of  estimating  the  probability  of  observing  a 
given  sample  of  text  from  a  particular  distribution.  For  example,  if  we  knew  a  distri¬ 
bution  of  words  in  a  certain  topic  of  interest,  we  could  estimate  the  probability  that  a 
given  document  is  relevant  to  that  topic,  as  was  done  in  [26,21, 14],  Alternatively,  we 
could  associate  a  probability  distribution  with  every  document  in  a  large  collection,  and 
calculate  the  probability  that  a  question  or  a  query  was  a  sample  from  that  document 
[18,4,15], 

1.1  Mixture  Models 

Mixture  models  represent  a  very  popular  estimation  technique  in  the  field  of  Language 
Modeling.  A  mixture  model  is  simply  a  linear  combination  of  several  different  distri¬ 
butions.  Mixture  models,  in  one  shape  or  another,  have  been  employed  in  every  major 
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Language  Modeling  publication  to  date.  For  example,  smoothing  [6, 12, 17],  a  critical 
component  of  any  language  model,  can  be  interpreted  as  a  mixture  of  a  topic  model 
with  a  background  model,  as  highlighted  in  [15, 13, 16], 

This  paper  will  be  primarily  concerned  with  the  use  of  mixtures  to  represent  se¬ 
mantic  topic  models.  For  the  scope  of  this  paper,  a  topic  model  will  be  defined  as  a 
distribution,  which  gives  the  probability  of  observing  any  given  word  in  documents  that 
discuss  some  particular  topic.  A  popular  way  to  estimate  the  topic  model  is  by  mixing 
word  probabilities  from  the  documents  that  are  believed  to  be  related  to  that  topic.  In 
the  next  section  we  will  briefly  survey  a  number  of  publications  exploring  the  use  of 
mixture  models  to  represent  topical  content. 

1.2  Related  Work  on  Mixture  Models 

Hoffman  [9]  described  the  use  of  latent  semantic  variables  to  represent  different  topical 
aspects  of  documents.  Hoffman  assumed  that  there  exist  a  fixed  number  of  latent  topical 
distributions  and  represented  documents  as  weighted  mixtures  of  those  distributions. 
Hoffman  used  an  expectation-maximization  algorithm  to  automatically  induce  topical 
distributions  by  maximizing  the  likelihood  of  the  entire  training  set.  It  is  worthwhile  to 
point  out  that  the  nature  of  the  estimation  algorithm  used  by  Hoffman  also  allows  one  to 
re-express  these  latent  aspect  distributions  as  mixtures  of  individual  document  models. 

Berger  and  Lafferty  [4]  introduced  an  approach  to  Information  Retrieval  that  was 
based  on  ideas  from  Statistical  Machine  Translation.  The  authors  estimated  a  semantic 
model  of  the  document  as  a  weighted  mixture  of  translation  vectors.  While  this  model 
does  not  involve  mixing  document  models,  it  is  still  an  example  of  a  mixture  model. 

In  the  context  of  Topic  Detection  and  Tracking  [1],  several  researchers  used  un¬ 
weighted  mixtures  of  training  documents  to  represent  event-based  topics.  Specifically, 
Jin  et.al.  [10]  trained  a  Markov  model  from  positive  examples,  and  Yamron  et.al.  [26] 
used  clustering  techniques  to  represent  background  topics  in  the  dataset  (a  topic  was 
represented  as  a  mixture  of  the  documents  in  the  cluster). 

Lavrenko  [13]  considered  topical  mixture  models  as  a  way  to  improve  the  effec¬ 
tiveness  of  smoothing.  Recall  that  smoothing  is  usually  done  by  combining  the  sparse 
topic  model  (obtained  by  counting  words  in  some  sample  of  text)  with  the  background 
model.  Lavrenko  hypothesized  that  by  using  a  zone  of  closely  related  text  samples  he 
could  achieve  semantic  smoothing,  where  words  that  are  closely  related  to  the  origi¬ 
nal  topic  would  get  higher  probabilities.  Lavrenko  used  an  unweighted  mixture  model, 
similar  to  the  one  we  will  describe  in  section  3.1.  The  main  drawback  of  the  approach 
was  that  performance  was  extremely  sensitive  to  the  size  of  the  subset  he  called  the 
zone.  A  similar  problem  was  encountered  by  Ogilvie  [16]  when  he  attempted  to  smooth 
document  models  with  models  of  their  nearest  neighbors. 

In  two  very  recent  publications,  both  Lafferty  and  Zhai  [11],  and  Lavrenko  and  Croft 
[14]  proposed  using  a  weighted  mixture  of  top-ranked  documents  from  the  query  to 
represent  a  topic  model.  The  process  of  assigning  the  weights  to  the  documents  is  quite 
different  in  the  two  publications.  Lafferty  and  Zhai  describe  an  iterative  procedure,  for¬ 
malized  as  a  Markov  chain  on  the  inverted  indexes.  Lavrenko  and  Croft  estimate  a  joint 
probability  of  observing  the  query  words  together  with  any  possible  word  in  the  vocab¬ 
ulary.  Both  approaches  can  be  expressed  as  mixtures  of  document  models,  and  in  both 


cases  the  authors  pointed  out  that  performance  of  their  methods  was  strongly  dependent 
on  the  number  of  top-ranked  documents  over  which  they  estimated  the  probabilities. 


1.3  Overview 

The  remainder  of  this  paper  is  structured  as  follows.  In  section  2  we  formally  define 
the  problem  of  finding  an  Optimal  Mixture  Model  (OMM)  for  a  given  observation.  We 
also  describe  a  lower  bound  on  solutions  to  any  OMM  problem.  Section  3  describes  un¬ 
weighted  optimal  mixture  models  and  proves  that  finding  such  models  is  computation¬ 
ally  infeasible.  Section  4.1  defines  weighted  mixture  models,  and  discusses  a  gradient 
descent  technique  for  approximating  them.  Section  5  describes  a  set  of  retrieval  exper¬ 
iments  we  carried  out  to  test  the  empirical  performance  of  Optimal  Mixture  Models. 

2  Optimal  Mixture  Models 

As  we  pointed  out  in  section  1.2,  a  number  of  researchers  [13, 16, 1 1, 14]  who  employed 
mixture  models  observed  that  the  quality  of  the  model  is  strongly  dependent  on  the  sub¬ 
set  of  documents  that  are  used  to  estimate  the  model.  In  most  cases  the  researchers  used 
a  fixed  number  of  top-ranked  documents,  retrieved  in  response  to  the  query.  The  num¬ 
ber  of  documents  turns  out  to  be  an  important  parameter  that  has  a  strong  effect  on 
performance  and  varies  from  query  to  query  and  from  dataset  to  dataset.  The  desire  to 
select  this  parameter  automatically  is  the  primary  motivation  behind  the  present  paper. 
We  would  like  to  find  the  optimal  subset  of  documents  and  form  an  Optimal  Mixture 
Model.  Optimality  can  be  defined  in  a  number  of  different  ways,  for  instance  it  could 
mean  best  retrieval  performance  with  respect  to  some  particular  metric,  like  precision  or 
recall.  However,  optimizing  to  such  metrics  requires  the  knowledge  of  relevance  judg¬ 
ments,  which  are  not  always  available  at  the  time  when  we  want  to  form  our  mixture 
model.  In  this  paper  we  take  a  very  simple  criterion  for  optimality.  Suppose  we  have 
a  sample  observation:  Wi  . . .  Wk ,  which  could  be  a  user’s  query,  or  an  example  docu¬ 
ment.  The  optimal  mixture  model  Mopt  is  a  model  which  assigns  the  highest  probability 
to  our  observation. 

2.1  Formal  Problem  Statement 

Suppose  W  =  {1 . . .  n}  is  our  vocabulary,  and  Wi . . .  Wk  is  a  string  over  that  vocab¬ 
ulary.  Let  P  be  the  simplex  of  all  probability  distributions  over  W ,  that  is  P  —  {x  € 
Mn  :  x  >  0,  \x\  =  1}.  In  most  cases  we  will  not  be  interested  in  the  whole  simplex  P, 
but  only  in  a  small  subset  P'  C  P,  which  corresponds  to  the  set  of  all  possible  mix¬ 
ture  models.  Exact  construction  of  P'  is  different  for  different  types  of  mixture  models, 
and  will  be  detailed  later.  The  optimal  model  Mopt  is  the  element  of  P'  that  gives  the 
maximum  likelihood  to  the  observation  Wi . . .  Wk : 

Mopt  =  arg  max  P(Wi . . .  Wk\M)  (1) 

M  6JP' 

In  Information  Retrieval  research,  it  is  common  to  assume  that  words  Wi  . . .  Wk 
are  mutually  independent  of  each  other,  once  we  fix  a  model  M.  Equivalently,  we  can 


say  that  each  model  M  is  a  unigram  model,  and  Wi  represent  repeated  random  sam¬ 
ples  from  M.  This  allows  us  to  compute  the  joint  probability  P(W\  . . .  Wk\M)  as  the 
product  of  the  marginals: 


k 

Mopt  =  arg  max  P(Wi\M)  (2) 

i= 1 

Now  we  can  make  another  assumption  common  in  Information  Retrieval:  we  de¬ 
clare  that  W\  . . .  Wk  are  identically  distributed  according  to  M,  that  is  for  every  word 
w  we  have  P{W\  =  w\M)  =  . . .  =  P(Wk  =  w\M).  Assuming  Wi  are  identically 
distributed  allows  us  to  re-arrange  the  terms  in  the  product  above,  and  group  together 
all  terms  that  share  the  same  w: 

Mopt  =  arg  max  J]  P{w\M)*wi-w'-w^  (3) 

w£W 

Here  the  product  goes  over  all  the  words  w  in  our  vocabulary,  and  #w(Wi  . . .  Wk) 
is  just  the  number  of  times  w  was  observed  in  our  sample  Wi  . . .  Wk  ■  If  we  let  Tw  = 
#w(W^...wk) ,  use  a  s}j0r{}jan(j  mw  for  P(w\M),  and  take  a  logarithm  of  the  objective 
(which  does  not  affect  maximization),  we  can  re-express  Mopt  as  follows: 

Mopt  =  arg  max  V'  Tw  log  Mw  (4) 

me  ip1 

w£W 

Note  that  by  definition,  T  is  also  a  distribution  over  the  vocabulary,  i.e.  T  is  a  mem¬ 
ber  of  IP,  although  it  may  not  be  a  member  of  our  subset  IP' .  We  can  think  of  T  as 
an  empirical  distribution  of  the  observation  Wi  . . .  Wk  -  Now,  since  both  M  and  T  are 
distributions,  maximization  of  the  above  summation  is  equivalent  to  minimizing  the 
cross-entropy  of  distributions  T  and  M: 

Mopt  =  arg  min  H(T\M)  (5) 

1  MEJP' 

Equation  (5)  will  be  used  as  our  objective  for  forming  optimal  mixture  models  in  all 
the  remaining  sections  of  this  paper.  The  main  differences  will  be  in  the  composition  of 
the  subset  IP1,  but  the  objective  will  remain  unchanged. 

2.2  Lower  Bound  on  OMM  solutions 

Suppose  we  allowed  IP'  to  include  all  possible  distributions  over  our  vocabulary,  that 
is  we  make  IP'  =  IP.  Then  we  can  prove  that  T  itself  is  the  unique  optimal  solution  of 
equation  (5).  The  proof  is  detailed  in  section  A.  1  of  the  Appendix. 

This  observation  serves  as  a  very  important  step  in  analyzing  the  computational 
complexity  of  finding  the  optimal  model  M  out  of  the  set  IP' .  We  proved  that  any  solu¬ 
tion  M  will  be  no  better  than  T  itself.  This  implies  that  for  every  set  IP' ,  determining 
whether  T  £  IP'  is  no  more  difficult  than  finding  an  optimal  mixture  model  from  that 
same  set  P' .  The  reduction  is  very  simple:  given  P'  and  T,  let  M  be  the  solution  of 
equation  (5).  Then,  according  to  section  A.l,  T  €  P'  if  and  only  if  M  —  T .  Testing 


whether  M  =  T  can  be  done  in  linear  time  (with  respect  to  the  size  of  our  vocabulary), 
so  we  have  a  polynomial-time  reduction  from  testing  whether  T  is  a  member  of  P'  to 
solving  equation  (5)  and  finding  an  optimal  mixture  model. 

This  result  will  be  used  in  the  remainder  of  this  paper  to  prove  that  for  certain  sets 
IP',  solving  equation  (5)  is  NP-hard.  In  all  cases  we  will  show  that  testing  whether 
T  £  IP'  is  NP-hard,  and  use  the  polynomial-time  reduction  from  this  section  to  assert 
that  solving  equation  (5)  for  that  particular  IP'  is  NP-hard  as  well. 

3  Set-based  Mixture  Models 

The  most  simple  and  intuitive  type  of  mixture  models  is  a  set-based  mixture.  In  this 
section  we  describe  two  simple  ways  of  constructing  a  mixture  model  if  we  are  given  a 
set  of  documents.  One  is  based  on  concatenating  the  documents  in  the  set,  the  other  - 
on  averaging  the  document  models.  Very  similar  models  were  considered  by  Lavrenko 
[13]  and  Ogilvie  [16]  in  their  attempts  to  create  unweighted  mixture  models.  Estimating 
either  of  these  models  from  a  given  set  of  documents  is  trivial.  However,  if  we  try  to 
look  for  the  optimal  set  of  documents,  the  problem  becomes  infeasible,  as  we  show  in 
section  3.3. 


3.1  Pooled  Optimal  Mixture  Models 

First  we  define  a  restricted  class  of  mixture  models  that  can  be  formed  by  “concatenat¬ 
ing”  several  pieces  of  text  and  taking  the  empirical  distribution  of  the  result.  To  make 
this  more  formal,  suppose  we  are  given  a  large  collection  C  of  text  samples  of  vary¬ 
ing  length.  In  this  paper  we  will  only  consider  finite  sets  C.  For  Information  Retrieval 
applications  C  will  be  a  collection  of  documents.  For  every  text  sample  T  €  C  we  can 
construct  its  empirical  distribution  by  setting  Tw  =  just  as  we  did  in  section  2.1. 

Here,  |T|  denotes  the  total  number  of  words  in  T-  Similarly,  for  every  subset  S  C  C,  we 
can  construct  its  empirical  distribution  by  concatenating  together  all  elements  T  €  S, 
and  constructing  the  distribution  of  the  resulting  text.  In  that  case,  the  probability  mass 
on  the  word  w  would  be: 


,  _  Etg5 
,w~  Zres\T\ 


(6) 


Now,  for  a  given  collection  of  samples  C,  we  define  the  pooled  mixture  set  IPc,pooi 
to  be  the  set  of  empirical  distributions  of  all  the  subsets  S  of  C,  where  probabilities 
are  computed  according  to  equation  (6).  We  define  the  Pooled  Optimal  Mixture  Model 
(POMM)  problem  to  be  the  task  of  solving  equation  (5)  over  the  set  Pc,PooU  i.e.  finding 
the  element  M  £  Pc, pool-  which  minimizes  the  cross-entropy  H(T\M)  with  a  given 
target  distribution  T. 


3.2  Averaged  Optimal  Mixture  Models 

Next  we  consider  another  class  of  mixture  models,  similar  to  pooled  models  described 
in  the  last  section.  These  models  are  also  based  on  a  collection  C  of  text  samples,  and 


can  be  formed  by  “averaging”  word  frequencies  across  several  pieces  of  text.  To  make 
this  formal,  let  C  be  a  finite  collection  of  text  samples.  Let  M.  be  the  corresponding 
collection  of  empirical  distributions,  that  is  for  each  observation  Tj  €  C,  there  exists 
a  corresponding  distribution  Mj  £  M.,  such  that  Mj}W  =  %fp-.  For  a  subset  S'  C 

C ,  we  can  construct  its  distribution  by  averaging  together  the  empirical  distributions 
of  elements  in  S' .  Let  S'  be  a  set  of  text  samples,  let  S  be  the  set  of  corresponding 
empirical  models,  and  let  #(S)  denote  the  number  of  elements  in  S.  The  probability 
mass  on  the  word  w  is: 


1 

I(S) 


E  M^ 

Mj  ES 


(7) 


For  a  given  collection  of  samples  C,  we  define  the  averaged  mixture  model  set 
Pc,avg  to  be  the  set  of  averaged  distributions  of  all  subsets  S'  of  C,  with  probabilities 
computed  according  to  equation  (7).  We  define  the  Averaged  Optimal  Mixture  Model 
(AOMM)  problem  to  be  the  task  of  solving  equation  (5)  over  the  set  lPc,avg- 


3.3  Finding  the  Optimal  Subset  is  Infeasible 

We  outlined  two  possible  ways  for  estimating  a  mixture  model  if  we  are  given  a  set  of 
documents.  Now  suppose  we  were  given  a  target  distribution  T  and  a  collection  C ,  and 
wanted  to  find  a  subset  S  C  C  which  produces  an  optimal  mixture  model  with  respect  to 
T.  It  turns  out  that  this  problem  is  computationally  infeasible.  Intuitively,  this  problem 
involves  searching  over  an  exponential  number  of  possible  subsets  of  C.  In  section  A.  3 
of  the  Appendix  we  prove  that  finding  an  optimal  subset  for  pooled  models  is  NP-hard. 
In  section  A. 4  we  show  the  same  for  averaged  models.  In  both  proofs  we  start  by  using 
the  result  of  section  2.2  and  converting  the  optimization  problem  to  a  decision  problem 
over  the  same  space  of  distributions.  Then  we  describe  a  polynomial-time  reduction 
from  3SAT  to  the  corresponding  decision  problem.  3SAT  (described  in  A. 2)  is  a  well- 
known  NP-hard  problem,  and  reducing  it  to  finding  an  optimal  subset  of  documents 
proves  our  searching  problem  to  be  NP-hard  as  well. 

It  is  interesting  to  point  out  that  we  were  not  able  to  demonstrate  that  finding  an 
optimal  subset  can  actually  be  solved  by  a  nondeterministic  machine  in  polynomial 
time.  It  is  easy  to  show  that  the  decision  problems  corresponding  to  POMM  and  AOMM 
are  in  the  NP  class,  but  the  original  optimization  problems  appear  to  be  more  difficult. 


4  Weighted  Mixture  Models 

Now  we  turn  our  attention  to  another,  more  complex  class  of  Optimal  Mixture  Models. 
For  set-based  models  of  section  3,  the  probabilities  were  completely  determined  by 
which  documents  belonged  to  the  set,  and  no  weighting  on  documents  was  allowed. 
Now  we  consider  the  kinds  of  models  where  in  addition  to  selecting  the  subset,  we  also 
allow  putting  different  weights  on  the  documents  in  that  subset.  This  flavor  of  mixture 
models  was  used  by  Hoffman  [9],  Lafferty  and  Zhai  [11],  and  Lavrenko  and  Croft  [14] 
in  their  research. 


4.1  Weighted  Optimal  Mixture  Models 


Now  if  we  want  to  find  an  optimal  mixture  model  for  some  observation  we  not  only  need 
to  find  the  subset  of  documents  to  use,  but  also  need  to  estimate  the  optimal  weights  to 
place  on  those  documents.  At  first  glance  it  appears  that  allowing  weights  on  documents 
will  only  aggravate  the  fact  that  finding  optimal  models  is  infeasible  (section  3.3),  since 
we  just  added  more  degrees  of  freedom  to  the  problem.  In  reality,  allowing  weights  to 
be  placed  on  documents  actually  makes  the  problem  solvable,  as  it  paves  the  way  for 
numerical  approximations.  Recall  that  both  POMM  and  AOMM  are  essentially  combi¬ 
natorial  problems,  in  both  cases  we  attempt  to  reduce  cross-entropy  (equation  (5))  over 
a  finite  set  of  distributions:  Pc, pool  for  POMM,  and  Pc,avg  for  AOMM.  Both  sets  are 
exponentially  large  with  respect  to  C,  but  are  finite  and  therefore  full  of  discontinuities. 
In  order  to  use  numerical  techniques  we  must  have  a  continuous  space  P' .  In  this  sec¬ 
tion  we  describe  how  we  can  extend  Pc, pool  °r  equivalently  Pc,avg  to  a  continuous 
simplex  Pc,\-  We  define  the  Weighted  Optimal  Mixture  Model  problem  (WOMM)  to 
be  the  optimization  of  equation  (5)  over  the  simplex  Pc, a-  We  argue  that  a  WOMM  so¬ 
lution  will  always  be  no  worse  than  the  solution  of  a  POMM  or  AOMM  for  a  given  C, 
although  that  solution  may  not  necessarily  lie  in  Pc, pool  or  Pc,avg-  We  look  at  a  simple 
gradient  descent  technique  for  solving  WOMM.  The  technique  is  not  guaranteed  to  find 
a  globally  optimal  solution,  but  in  practice  converges  quite  rapidly  and  exhibits  good 
performance. 


WOMM  Definition  Let  C  be  our  set  of  text  samples  and  let  Me  be  the  corresponding 
set  of  of  empirical  models  Mj-  for  each  sample  T  €  C.  For  an  arbitrary  set  of  weights 
a  e  m we  can  define  the  corresponding  model  M\  to  be  the  average  of  all  the 
models  in  Me,  weighted  by  A: 


MA,W  =  XtMt,w  (8) 

Tec 

It  is  easy  to  verify  that  equation  (8)  defines  a  valid  distribution,  as  long  as  |A|  =  1. 
Now  we  can  define  Pc,\  to  be  the  set  of  all  possible  linear  combinations  of  models  in 
Me,  i-e.  Pc, \  =  {M\  :  A  >  0,  | A |  =  1}.  WOMM  is  defined  as  solving  equation  (5) 
over  Pc,\. 

Relationship  to  Set-based  Models  It  is  important  to  realize  that  there  is  a  strong  con¬ 
nection  between  WOMM  and  set-based  models  from  section  3.  The  simplex  Pc, \  in¬ 
cludes  both  sets  Pc, pool  and  Pc,avg ,  since: 

(i)  equations  (7)  and  (8)  imply  that  an  AOMM  model  of  a  set  S  is  the  same  thing  as  a 
WOMM  model  M\  where  A7-  =  when  T  £  S,  and  A -7-  =  0  for  T  $  S 

(ii)  equations  (6)  and  (8)  imply  that  a  POMM  model  of  a  set  S  is  equivalent  to  a 

WOMM  model  M\  where  A  7-  =  |  f  \  when  T  £  S,  and  A  7-  =  0  for  T  $  S 

This  implies  that  every  element  of  either  Pc,avg  01'  Pc, pool  is  also  an  element  of  Pc,\- 
Therefore,  a  weighted  optimal  mixture  model  will  be  as  good,  or  better  than  any  set- 
based  mixture  model,  as  long  as  we  are  dealing  with  the  same  collection  C. 


4.2  Iterative  Gradient  Solution 


Since  IPc.x  is  a  continuous  simplex,  we  can  employ  numerical  techniques,  to  iteratively 
approach  a  solution.  We  describe  a  gradient  descent  approach,  similar  to  the  one  advo¬ 
cated  by  Yamron  et.al.  [26].  Recall  that  our  objective  is  to  minimize  the  cross-entropy 
(equation  (5)  of  the  target  distribution  T  over  the  simplex  Pc,\ ■  For  a  given  collection 
C,  every  element  of  Pc,\  can  be  expressed  in  terms  of  A,  the  vector  of  mixing  weights, 
according  to  equation  (8).  We  rewrite  the  objective  function  in  terms  of  the  mixing 
vector  A: 


Hx(T\{Mr})  (9) 

=  ~Y,TW  log  E  (Ar/^rAr)  MT,W 

w  T 

=  -  ^  Tw  log  ^  A  rMr,w  +  log  ^  A  r 

w  T  T 

Note  that  in  equation  (10),  we  used  the  expression  S^T  instead  of  A7-.  Doing  this 
allows  us  to  enforce  the  constraint  that  the  mixing  weights  should  sum  to  one  without 
using  Lagrange  multipliers  or  other  machinery  of  constrained  optimization.  In  other 
words,  once  we  made  this  change  to  the  objective  function,  we  can  perform  uncon¬ 
strained  minimization  over  A.  In  order  to  find  the  maximum  of  equation  (10)  we  take 
the  derivative  with  respect  to  the  mixing  weight  of  each  element  k: 

9H\  _  TwMktW  1 

After  setting  the  derivative  equal  to  zero,  and  re-arranging  the  terms,  we  see  that  the 
extremum  is  achieved  when  for  every  k  &  C  we  have: 


W 


T-'tV  Af/,; ,  u: 

Lr  (Ar/^VAr)  MTtW 


(11) 


We  can  take  this  equation  and  turn  it  into  an  incremental  update  rule.  Suppose  Ajj 
is  the  mixing  weight  of  element  k  after  n  iterations  of  the  algorithm.  Then  at  the  next 
iteration  the  weight  should  become: 


\n+l  , _  lwlVlk,wAk 

*  YjT  I ^t^t)  Mr,w  } 

It  is  easy  to  see  that  when  the  extremum  is  achieved,  equation  (11)  holds,  and  the 
value  \k  will  not  change  from  one  iteration  to  another,  so  the  procedure  is  convergent. 
In  practice,  it  is  sufficient  to  run  the  procedure  for  just  a  few  iterations,  as  it  converges 
rapidly.  Every  iteration  of  update  (12)  requires  on  the  order  of  (#(C)  x  #(VL)),  and 
the  number  of  iterations  can  be  held  at  a  constant. 


Local  Minima  It  is  important  to  realize  that  the  iterative  update  in  equation  (12)  is  not 
guaranteed  to  converge  to  the  global  minimum  of  equation  (10).  The  reason  for  that  is 
that  the  objective  function  is  not  convex  everywhere.  We  can  see  that  clearly  when  we 
take  the  second  derivative  of  the  objective  with  respect  to  the  mixture  weights: 


d2Hx 

dXkdXi 


£ 


TwMk,wMi^w  1 

(Sj  XjMj)W^  X j'j 


(13) 


It  is  not  obvious  whether  left-hand  side  of  the  equation  above  is  positive  or  negative, 
so  we  cannot  conclude  whether  the  function  is  globally  convex,  or  whether  it  has  local 
minima.  In  practice  we  found  that  the  incremental  algorithm  converges  quite  rapidly. 


5  Experimental  Results 

In  this  section  we  discuss  an  application  of  Optimal  Mixture  Models  to  the  problem  of 
estimating  a  topic  model  from  a  small  sample.  The  experiments  were  carried  out  in  the 
following  setting.  Our  collection  C  is  a  collection  of  approximately  60,000  newswire 
and  broadcast  news  stories  from  the  TDT2  corpus  [7].  For  this  dataset,  we  have  a  col¬ 
lection  of  96  event-centered  topics.  Every  topic  is  defined  by  the  set  R  C  C  of  stories 
that  are  relevant  to  it.  The  relevance  assessments  were  carried  out  by  LDC  [7]  and  are 
exhaustive. 

For  every  topic,  our  goal  is  to  estimate  Mr,  the  distribution  of  words  in  the  doc¬ 
uments  relevant  to  that  topic.  We  assume  that  the  relevant  set  R  is  unknown,  but  we 
have  a  single  example  document  T  €  R.  The  goal  is  to  approximate  the  topic  model 
Mr  as  closely  as  possible  using  only  T  and  C.  We  formulate  the  problem  as  finding  the 
optimal  weighted  mixture  model,  and  use  the  iterative  update  detailed  in  section  4.2  to 
estimate  the  optimal  mixture.  Since  we  do  not  know  R,  we  cannot  optimize  equation 
(5)  for  Mr  directly.  We  hypothesize  that  optimizing  for  T  =  Mj-  is  a  good  alternative. 
This  hypothesis  is  similar  to  the  assumptions  made  in  [14].  Note  that  this  assumption 
may  be  problematic,  since  effectively  T  is  an  element  of  C.  This  means  that  our  gradi¬ 
ent  solution  will  eventually  converge  to  M-j-,  which  is  not  what  we  want,  since  we  want 
to  converge  to  Mr.  However,  we  hope  that  running  the  gradient  method  for  just  a  few 
iterations  will  result  in  a  reasonable  mixture  model. 

We  carry  out  three  types  of  experiments.  First  we  demonstrate  that  the  gradient 
procedure  described  in  section  4.2  indeed  converges  to  the  target.  Second  we  look  at 
how  well  the  resulting  mixture  model  approximates  the  real  topic  model  Mr.  Finally, 
we  perform  a  set  of  ad-hoc  retrieval  experiments  to  demonstrate  that  our  mixture  model 
can  be  used  to  produce  effective  document  rankings. 

5.1  Convergence  to  Target 

Figure  1  shows  how  quickly  the  weighted  mixture  model  converges  to  the  target  dis¬ 
tribution  M-j-.  On  the  y-axis  we  plotted  the  relative  entropy  (equation  (14)  in  the  Ap¬ 
pendix)  between  the  mixture  model  and  the  target  model,  as  a  function  of  a  number  of 
gradient  updates.  Relative  entropy  is  averaged  over  all  96  topics.  The  solid  line  shows 


the  mean  value,  while  the  bars  indicate  one  standard  deviation  around  the  mean.  We 
observe  that  relative  entropy  rapidly  converges  to  zero,  which  is  understandable,  since 
T  €  C.  Eventually,  the  iterative  procedure  will  force  the  mixing  vector  A  to  have  zeros 
in  all  places  except  A 7-,  which  would  be  driven  to  one. 


Fig.  1.  Convergence  of  the  Weighted  Optimal  Mixture  Model  to  the  target  distribution.  Target  is  a 
single  document  discussing  the  topic  of  interest.  Mixture  Model  estimated  using  iterative  gradient 
updates. 


5.2  Convergence  to  True  Topic  Model 

In  the  next  set  of  experiments,  we  measure  relative  entropy  of  our  mixture  model  with 
respect  to  Mr,  which  is  the  true  topic  model.  Mr  is  estimated  as  an  averaged  mixture 
model  from  R,  the  set  of  documents  known  to  be  relevant  to  the  topic.  Note  that  since 
the  solutions  were  optimized  towards  Mj-  and  not  towards  Mr,  we  cannot  expect  rel¬ 
ative  entropy  to  continue  decreasing  with  more  and  more  iterations.  In  fact,  once  the 
solution  is  very  close  to  Mf,  we  expect  it  to  be  “far”  from  Mr.  What  we  hope  for  is 
that  after  a  few  (but  not  too  many)  iterations,  the  solution  will  be  sufficiently  close  to 
Mr.  Figure  2  shows  the  performance  with  respect  to  Mr  for  0,  1,  2  and  5  iterations. 
We  show  relative  entropy  as  a  function  of  how  many  documents  we  include  in  our  col¬ 
lection  C.  The  documents  are  ranked  using  T  as  query,  and  then  some  number  n  of 
top-ranked  documents  is  used.  This  is  done  to  highlight  the  fact  that  even  with  gradient 
optimization,  the  model  shows  strong  dependency  on  how  many  top-ranked  documents 
are  used  to  estimate  the  probabilities.  For  comparison,  we  show  the  lower  bound,  which 
is  what  we  would  obtain  if  we  used  n  documents  that  are  known  to  be  in  R. 


2.6 


Fig.  2.  Convergence  of  the  mixture  model  to  the  true  topic  model:  running  the  gradient  update  for 
two  iterations  is  promising. 


We  observe  that  running  the  gradient  algorithm  for  two  iterations  moves  the  solution 
closer  to  Mr,  but  doing  more  iterations  actually  hurts  the  performance.  Running  it 
for  five  iterations  results  in  higher  relative  entropy  than  not  running  it  at  all.  We  also 
note  that  with  more  iterations,  performance  becomes  less  sensitive  to  the  number  of 
documents  in  C.  Overall,  we  can  conclude  that  running  the  gradient  algorithm  for  very 
few  iterations  is  promising. 

5.3  Retrieval  Experiments 

Finally,  we  consider  an  application  of  Optimal  Mixture  Models  to  the  problem  of  docu¬ 
ment  retrieval.  We  start  with  a  single  relevant  example  and  estimate  a  mixture  model  as 
was  described  above.  Then,  for  every  document  in  our  dataset  we  compute  the  relative 
entropy  (equation  (14)  in  the  Appendix)  between  the  model  of  that  document  and  the 
estimated  mixture  model.  The  documents  are  then  ranked  in  increasing  order  of  rel¬ 
ative  entropy.  This  type  of  ranking  was  first  proposed  by  Lafferty  and  Zhai  [11]  and 
was  found  to  be  quite  successful  in  conjunction  with  Query  Models  [11]  and  Relevance 
Models  [14], 

Figure  3  shows  the  retrieval  performance  when  a  mixture  model  is  estimated  over  a 
set  of  10  top-ranked  documents  using  two  iterations  of  the  gradient  update.  To  provide 
a  baseline,  we  replace  the  estimated  mixture  model  by  the  model  of  the  single  training 
example,  and  perform  retrieval  using  the  same  ranking  function.  For  comparison  we 
also  show  the  performance  unweighted  mixture  model  formed  from  the  same  set  of  10 
top-ranked  documents. 


Precision 
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Fig.  3.  Document  retrieval:  mixture  model  (top)  outperforms  a  baseline  of  single  training  example 
(middle).  Mixture  model  used  10  documents  for  two  iterations. 


20  30  50 

Number  of  documents  in  the  set 


Fig.  4.  Optimal  Mixture  Models  exhibit  less  dependency  on  the  number  of  top-ranked  documents 
used  in  the  estimation.  Two  or  more  iterations  with  any  number  of  documents  outperforms  the 
baseline. 


We  observe  that  our  weighted  mixture  noticeably  outperforms  the  baseline  at  all 
levels  of  recall.  The  improvement  is  statistically  significant  at  all  levels  of  recall,  except 
at  zero.  Significance  was  determined  by  performing  the  sign  test  with  a p  value  of  0.05. 
The  magnitude  of  the  improvement  is  not  very  large,  partly  due  to  the  fact  that  the  base¬ 
line  performance  is  already  very  high.  Baseline  has  non-interpolated  average  precision 
of  0.7250,  which  is  very  high  by  TREC  standards.  Such  high  performance  is  common  in 
TDT,  where  topic  definitions  are  much  more  precise  than  in  TREC.  Weighted  mixture 
model  yields  average  precision  of  0.7583,  an  improvement  of  5%.  Note  that  unweighted 
mixture  model  constructed  from  the  same  set  of  documents  as  the  weighted  model  per¬ 
forms  significantly  worse,  resulting  in  a  9%  drop  from  the  baseline.  This  means  that 
document  weighting  is  a  very  important  aspect  of  estimating  a  good  topic  model. 

Finally,  we  re-visit  the  issue  that  motivated  this  work,  the  issue  of  sensitivity  to 
the  number  of  top-ranked  documents  used  in  the  estimation.  We  repeated  the  retrieval 
experiment  shown  in  Figure  3,  but  with  varying  numbers  of  top-ranked  documents. 
The  results  are  summarized  in  Figure  4.  We  observe  that  Weighted  Optimal  Mixture 
Model  with  two  or  more  iterations  of  training  is  fairly  insensitive  to  the  number  of  top- 
ranked  documents  that  are  used  in  the  estimation.  The  performance  is  always  above  the 
single-document  baseline.  In  contrast  to  that,  we  see  that  unweighted  models  (uniform 
weights)  perform  significantly  worse  than  the  baseline,  and  furthermore  their  perfor¬ 
mance  varies  widely  with  the  number  of  top-ranked  documents  used.  We  believe  these 
results  to  be  extremely  encouraging,  since  they  show  that  weighted  mixture  models  are 
considerably  more  stable  than  unweighted  models. 


6  Conclusions  and  Future  Work 


In  this  paper  we  explored  using  Optimal  Mixture  Models  to  estimate  topical  models. 
We  defined  optimality  in  terms  of  assigning  maximum  likelihood  to  a  given  sample  of 
text,  and  looked  at  two  types  of  mixture  models:  set-based  and  weighted.  We  presented 
an  original  proof  that  it  is  not  feasible  to  compute  set-based  optimal  mixture  models. 
We  then  showed  that  weighted  mixture  models  are  superior  to  set-based  models,  and 
suggested  a  gradient  descent  procedure  for  estimating  weighted  mixture  models.  Our 
experiments  show  weighted  mixtures  outperforming  the  baseline  on  a  simple  retrieval 
task,  and,  perhaps  more  importantly,  demonstrate  that  weighted  mixtures  are  relatively 
insensitive  to  the  number  of  top-ranked  documents  used  in  the  estimation. 

In  the  course  of  this  work  we  encountered  a  number  of  new  questions  that  warrant 
further  exploration.  We  would  like  to  analyze  in  detail  the  relationship  between  Optimal 
Mixture  Models  proposed  in  this  paper  and  other  methods  of  estimating  a  topic  model 
from  a  small  sample.  The  two  obvious  candidates  for  this  comparison  are  query  models 
[11]  and  relevance  models  [14].  Both  are  very  effective  at  estimating  accurate  topic 
models  starting  with  a  very  short  (2-3  word)  sample.  Optimal  Mixture  Models  appear 
to  be  more  effective  with  a  slightly  longer  sample  (200-300  words).  Another  question 
we  would  like  to  explore  is  the  use  of  tempered  or  annealed  gradient  descent  to  prevent 
over-fitting  to  the  target  sample.  Finally,  we  would  like  to  explore  in  detail  the  impact 
of  the  length  of  the  training  sample  on  the  quality  of  the  resulting  model. 
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A  Appendix 

A.l  Lower  bound  on  cross-entropy 

In  this  section  we  prove  that  T  itself  is  the  unique  optimal  solution  of  equation  (5).  The 
proof  is  rudimentary  and  can  be  found  in  many  Information  Theory  textbooks,  but  is 
included  in  this  paper  for  completeness.  Assuming  T  is  in  IP,  we  have  to  show  that: 

(i)  for  every  M  £  IP  we  have  H(T\T)  <  H(T\M) 

(ii)  if  H(T\T)  =  H(T\M)  then  M  =  T. 

To  prove  the  first  assertion  we  need  to  show  that  the  difference  in  cross-entropies 
H (T\M)  —  II (T\T)  is  non-negative.  This  difference  is  sometimes  referred  to  as  relative 
entropy  or  Kullback-Leiblar  divergence.  By  definition  of  cross-entropy: 

H{T\M)-H{T\T)  =  ~Y,Tw\og^  (14) 

W 

w 

By  Jensen’s  inequality,  for  any  concave  function  f(x)  we  have  E[f(x)]  <  f(E[x]). 
Since  logarithm  is  a  concave  function,  we  obtain: 

-'£Tv,]oe^>-log'£Tw^  (15) 

J-  w  J-w 

w  w 

Now  we  note  that  right-hand  side  of  the  above  equation  is  zero,  which  proves  (i).  In 
order  to  prove  (ii)  we  recall  that  Jensen’s  inequality  becomes  an  equality  if  and  only  if 
f{x)  is  linear  in  x.  The  only  way  log  can  be  linear  in  is  when  Mw  =  kTw,  for 
some  constant  k.  But  since  M  and  T  are  both  in  IP,  we  must  have: 

i  =  tw  =  mw  =  ^  kTw  =  fc 

WWW  w 

Hence  k  =  1,  and  therefore  M  =  T,  which  proves  (ii). 


A.2  3SAT  Definition 


3SAT  is  a  problem  of  determining  satisfiability  of  a  logical  expression.  An  instance  of 
3SAT  is  a  logical  formula  F  in  conjunctive  normal  form,  with  clauses  limited  to  contain 
three  primitives.  The  formula  is  a  conjunction  (an  AND)  of  m  clauses,  each  of  which 
is  a  disjunction  (an  OR)  of  three  variables  from  the  set  {2:1 . or  their  negations 
{-iii ...^Xn}.  An  example  of  such  formula  may  be:  F  =  (-12:1  V  X2  V  2:3)  A  (-12:1  V 
-12:2  V  ->2:4).  Every  formula  of  propositional  calculus  can  be  represented  as  a  3SAT 
formula.  The  task  is  to  determine  whether  F  is  satisfiable.  The  formula  is  satisfiable  if 
there  exists  an  assignment  of  true/false  values  to  the  variables  xi...xn,  which  makes  the 
whole  formula  true.  In  the  given  example,  setting  2:1  =  False  and  all  other  variables 
to  True  satisfies  the  formula. 


A.3  POMM  is  NP-hard 


In  this  section  we  will  show  that  solving  Pooled  Optimal  Mixture  Model  problem  is  NP- 
hard.  In  order  to  do  that,  we  define  a  corresponding  decision  problem  EXACT-POMM, 
prove  EXACT-POMM  to  be  NP-hard,  and  use  the  results  in  section  2.2  to  assert  that 
POMM  itself  is  NP-hard. 

In  a  nutshell,  EXACT-POMM  is  a  problem  of  testing  whether  a  target  distribution 
T  is  a  member  of  the  pooled  mixture  model  set  Pc, pool  induced  by  some  collection 
C  of  text  samples.  Let  M.  denote  the  collection  of  integer-valued  count  vectors  corre¬ 
sponding  to  each  text  sample  in  C.  Each  element  of  Ad  is  a  vector  Mj  that  specifies 
how  many  times  every  word  w  occurs  in  7),  the  corresponding  text  sample  in  C,  i.e. 
Mj, w  =  #w(Tj).  The  task  is  to  determine  whether  there  exists  a  subset  S  C  M.,  such 
that  vectors  in  S,  when  added  up,  form  the  same  distribution  as  T: 


T  = 

w  — 


SmjGS  Mj,w 
Ylw  Em;  es  Mj,w 


for  all  w  £  W 


(16) 


We  will  prove  that  EXACT-POMM  is  NP-hard  by  reduction  from  3SAT,  a  well- 
known  problem  of  the  NP-complete  class  [22] .  We  describe  a  polynomial  time  reduction 
that  converts  an  instance  of  3SAT  into  an  instance  of  EXACT-POMM,  such  that  3SAT 
instance  is  satisfiable  if  and  only  if  EXACT-POMM  instance  has  a  positive  answer. 
The  reduction  we  describe  is  very  similar  to  the  one  commonly  used  to  prove  that 
SUBSET-SUM  problem  is  NP-hard.  We  are  given  a  formula  in  conjunctive  normal 
form:  F  =  (pi,i  V  p\,2  V  pi.3)  A  ...  A  (pm,  1  V  pm, 2  V  pm,z),  where  each  proposition 
Pj,k  is  either  some  variable  xt  or  its  negation  -12 We  want  to  construct  a  collection 
of  vectors  M.  and  a  target  distribution  T  such  that  F  is  satisfiable  if  and  only  if  there 
exists  a  subset  S  C  M.  of  vectors  which  satisfies  equation  (16).  Let  M.  be  the  set  of 
rows  of  the  matrix  in  Figure  5,  let  T1  be  the  target  row  at  the  bottom,  and  set  the  target 
distribution  Tw  =T'wj  T'w  f°r  w- 

In  Figure  5,  Inxn  denotes  a  n  by  n  identity  matrix.  POSnxm  is  a  n  by  m  positive 
variable-clause  adjacency  matrix,  that  is  POSi,j  =  1  when  clause  j  contains  X{,  and 
POSi,j  =  0  otherwise.  Similarly,  NEGnxm  is  a  n  by  m  negative  variable-clause 
adjacency  matrix:  NEGi,j  is  1  if  clause  j  contains  -12 and  0  if  it  does  not.  C hmxn  is 
just  a  2m  by  n  all-zero  matrix. 


InXn 

POSnXm 

IfiXn 

NEGnxm 

02  mXn 

Im  X  m 

Im  X  m 

1-1 

CO 

CO 

Fig.  5.  Matrix  used  in  the  proof  that  3S  AT  is  reducible  to  EXACT-POMM 


Note  that  the  matrix  is  identical  to  the  one  that  is  used  to  prove  that  the  SUBSET- 
SUM  problem  [23]  is  NP-hard  by  reduction  from  3SAT.  In  line  with  the  SUBSET-SUM 
argument,  it  is  easy  to  see  that  the  formula  F  is  satisfiable  if  and  only  if  there  exists 
a  subset  S  C  M.  of  the  rows  such  that  T. ''w  =  Ym-es  Mj,w  for  all  w.  The  first  n 
components  of  target  (T{  . . .  T^)  ensure  that  for  every  variable  i  either  xt  is  true  or  -> Xi 
is  true,  but  not  both.  The  last  m  components  (T'n+i  . . .  T'n+rn)  ensure  that  every  clause 
has  at  least  one  proposition  set  to  true. 

To  demonstrate  that  satisfiability  of  F  reduces  to  EXACT-POMM,  we  just  need  to 
show  that: 

Mjes 


T  = 

-1-  W  - 


Ymj  ES  Mj.w 

Yw  YmjES  Mj,w 


(17) 


The  forward  implication  is  trivial,  since  T'v  =  Ym-es  ^j,w  clearly  implies  that 
Yw  T'w  =  Yw  Ymjes  and  therefore 


T  = 

J-w  — 


TL 


Y 


MjES  Mj,w 


Y  T' 

Z-JW  W 


Yw  YmjES  Mj.w 


(18) 


The  converse  is  not  obvious,  since  the  equality  of  two  ratios  in  equation  (18)  does 
not  by  itself  guarantee  that  their  numerators  are  equal,  we  have  to  prove  that  their  de¬ 
nominators  are  equal  as  well.  The  proof  is  as  follows.  Assume  equation  (18)  holds.  Let 
k  be  a  (constant)  ratio  of  the  denominators: 


k  = 


Yw  Ymj  E  S  Mj.w 
Y  T> 

Z—dW  w 


Then  we  can  re-write  equation  (18)  as  follows: 


(19) 


*n=  E  MjjW,  for  all  w  (20) 

MjES 

Now,  for  the  left  side  of  the  matrix  in  Figure  5  (w  =  1 ...  n),  we  have  Tw  =  1, 
and  Yw  MjtW  can  be  0,  1  or  2.  Therefore  it  must  be  that  k  is  either  1  or  2,  otherwise 
we  cannot  satisfy  equation  (20).  Similarly,  for  the  right  side  of  the  matrix  (w  =  n  + 
1 . . .  n+m)  we  have  Tw  =  3,  and  Yw  Mj.w  can  be  0, 1, 2,  3, 4  or  5.  Therefore,  to  satisfy 
equation  (20),  k  must  take  a  value  in  the  set  {|,  |,  |,  |,  |}.  Since  equation  (20)  must 
be  satisfied  for  all  w ,  k  must  take  a  value  in  the  intersection  {1, 2}  n  |,  |,  §}. 


Therefore  k  =  1,  which  means  that  the  denominators  in  equation  (18)  are  equal,  and 
the  converse  implication  holds. 

The  matrix  in  Figure  5  is  of  size  (2 n  +  2m)  by  n  +  m,  therefore  we  have  a 
polynomial-time  reduction  from  an  instance  of  3S  AT  to  an  instance  of  EXACT-POMM, 
such  that  a  3SAT  formula  is  satisfiable  if  and  only  if  a  corresponding  instance  of 
EXACT-POMM  is  satisfiable.  Accordingly,  if  a  polynomial-time  algorithm  exists  for 
solving  EXACT-POMM  problems,  this  algorithm  could  be  used  to  solve  3SAT  in  poly¬ 
nomial  time.  Thus  EXACT-POMM  is  NP-hard. 

We  have  just  demonstrated  that  EXACT-POMM  is  NP-hard.  In  section  2.2  we 
demonstrated  a  reduction  from  decision  problems  (like  EXACT-POMM)  to  optimiza¬ 
tion  problems  (like  POMM).  Therefore  POMM  is  NP-hard. 


A.4  AOMM  is  NP-hard 


In  this  section  we  will  prove  that  solving  AOMM  problem  is  NP-hard.  The  proof  fol¬ 
lows  the  same  outline  as  the  proof  in  section  A. 3.  We  define  a  corresponding  decision 
problem  EXACT- AOMM,  prove  it  to  be  NP-hard  by  reduction  from  3SAT,  then  use  the 
result  of  section  2.2  to  assert  that  AOMM  is  NP-hard. 

AOMM  is  defined  as  an  optimization  problem  over  Pc,avg  with  respect  to  a  given 
target  T.  Correspondingly,  EXACT- AOMM  is  a  problem  of  determining  whether  T  it¬ 
self  is  an  element  of  !Pc,avg-  An  instance  of  EXACT- AOMM  is  a  target  distribution 
T  G  P,  and  a  finite  collection  of  empirical  distributions  M.  =  {Mj  €  P}  correspond¬ 
ing  to  a  set  of  text  samples  C.  The  task  is  to  determine  whether  there  exists  a  subset 
S  £  M.,  such  that  the  average  of  elements  in  Mj  G  S  is  exactly  the  same  as  T: 


1 


E 

MjES 


for  all  w  €  W 


(21) 


We  now  present  a  reduction  from  3SAT  to  EXACT-AOMM.  We  are  given  F,  an 
instance  of  3SAT,  and  want  to  construct  a  set  of  distributions  M.  together  with  a  target 
T,  such  that  F  is  satisfiable  if  and  only  if  there  exists  a  subset  S  G  M.  which  satisfies 
equation  (21).  The  construction  of  EXACT-AOMM  is  similar  to  EXACT-POMM,  but 
the  proof  is  complicated  by  the  fact  that  we  are  now  dealing  with  distributions,  rather 
than  count  vectors.  We  cannot  directly  use  the  matrix  constructed  in  Section  A. 3  (Figure 
5),  because  the  rows  may  have  varying  numbers  of  non-zero  entries,  and  if  we  normalize 
them  to  lie  in  IP,  the  proof  might  fail.  For  example,  suppose  we  normalize  the  rows  of 
the  matrix  in  Figure  5.  Then  a  1  in  a  row  from  the  upper  half  of  that  matrix  becomes  1  /to 
(where  to  is  the  the  number  of  non-zero  entries  in  that  row,  which  will  be  greater  than 
1  for  any  variable  xt  that  occurs  in  any  of  the  clauses  in  F).  At  the  same  time,  a  1  in  the 
lower  half  of  the  matrix  remains  a  1,  since  every  row  in  the  lower  half  contains  exactly 
one  non-zero  entry.  To  alleviate  this  problem,  we  augment  the  matrix  from  Figure  5  in 
such  a  way  that  every  row  has  the  same  number  of  non-zero  entries.  This  is  non-trivial, 
since  we  have  to  augment  the  matrix  in  such  a  way  that  satisfiability  of  3SAT  is  not 
violated. 

Let  M.'  be  the  set  of  rows  of  the  matrix  in  Figure  6,  and  let  T1  be  the  target  (bottom) 
row.  Here  sub-matrices  /,  0,  POS  and  NEG  have  the  same  meaning  as  in  Section 
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Fig.  6.  Matrix  used  in  the  proof  that  3S  AT  is  reducible  to  EXACT- AOMM 


A. 3.  In  addition,  -> POS  is  a  logical  negation  of  POS ,  i.e.  -> POSij  =  1  whenever 
POSij  =  0,  and  vice-versa.  Similarly,  -> NEG  is  a  negation  of  NEG.  The  last  column 
of  the  matrix  contains  zeros  in  the  top  2 n  rows  and  m  in  the  bottom  3m  rows.  Note  that 
the  new  matrix  contains  the  tableau  constructed  in  Section  A. 3  as  a  sub-matrix  in  the 
upper-left  corner.  The  columns  to  the  right  of  the  original  tableau  ensure  that  the  sum 
of  every  row  in  the  matrix  is  exactly  m  +  1.  The  first  n  +  m  positions  of  the  target  row 
are  also  identical  to  the  target  row  for  the  POMM  problem. 

We  define  a  corresponding  instance  of  EXACT- AOMM  by  setting  T  =  T'/(n  + 
3m  +  nm  +  3m2),  and  M.  =  {Mj/(m  +  1)  :  Mj  £  M.'},  dividing  each  row  by  the 
sum  of  its  elements  to  ensure  that  the  result  is  in  P.  Now  we  want  to  prove  that  F  is 
satisfiable  if  and  only  there  exists  a  subset  S  C  M  which  satisfies  equation  (21). 

Let  S  C  M.  be  the  subset  that  satisfies  equation  (21).  We  assert  that  #(5),  the  size 
of  5,  must  be  (n  +  3m).  The  proof  is  as  follows: 

(i)  The  sum  of  the  first  column  over  S  can  be  either  *  .  or  2  , ,  and  the  target  is 

Ti  =  -3m)(m+i)-  Therefore  #(5)  can  be  either  (n  +  3m)  or  2 (n  +  3m). 

(ii)  The  sum  of  the  last  column  over  S  can  be  anyone  of  x{l,  2, 3, . . . ,  m, . . . ,  5m}, 

while  the  target  is  ■  Therefore  #(S)  can  be  any  one  of  nj~^m  x 

{1,2,3,  ...,m,  ...,5m}. 


Since  S  must  satisfy  equation  (21)  for  all  w ,  we  must  have  #(S)  satisfy  both  (i)  and  (ii), 

M'. 

and  therefore  it  must  be  that  #(5)  =  (n  +  3m).  Now,  since  by  definition  M^w  =  ++ 
T' 

and  Tw  =  7 — — ft — m-,  we  can  claim  that: 

W  (77+3777.)  (771+1)  ’ 


T  — 

-L  7/J  - 


*(S) 


T, 


Mj  ES 


M'.es1 


M'- 


(22) 


Here  S'  is  a  subset  of  the  rows  that  corresponds  to  S.  It  remains  to  show  that  F 
(an  instance  of  3SAT)  is  satisfiable  if  and  only  if  there  exists  a  subset  S'  £  M.'  which 
satisfies  the  right  hand  side  of  equation  (22). 


(<=)  If  such  S'  exists,  columns  {1 ...  n}  of  M.'  guarantee  that  we  have  a  proper  truth 
assignment,  and  columns  {n  +  1 . . .  n  +  m}  guarantee  that  every  clause  of  F  has 
at  least  one  true  proposition,  thus  F  is  satisfied. 


(=>)  If  F  is  satisfiable,  it  is  clear  that  right  hand  side  of  equation  (22)  holds  for  columns 
{1 . . .  n  +  m}  (e.g.  by  the  the  SUBSET-SUM  argument).  Each  one  of  the  columns 
{n  +  l...n  +  to}  contained  1,  2,  or  3  non-zero  entries  (in  the  first  2 n  rows), 
and  therefore  required  2,  1  or  0  “helper”  variables  to  add  up  to  3.  Because  columns 
{n-t-TO+l . . .  n+m+rn}  represent  a  logical  negation  of  columns  {n+1 . .  .n+m}, 
they  will  contain  (n  —  1),  (n  —  2)  or  (n  —  3)  non-zero  entries  respectively  and  may 
require  1,  2,  or  3  “helper”  variables  to  sum  to  n.  Finally,  the  last  column  will  reflect 
the  total  number  of  “helpers”  used  for  each  of  the  m  clauses,  which  will  always  be 
3  for  every  clause:  either  (2  +  1),  or  (1  +  2),  or  (0  +  3).  Since  there  are  m  clauses, 
the  last  column  will  sum  up  to  3m  x  to. 

This  proves  that  F  is  satisfiable  if  and  only  if  the  right  hand  side  of  equation  (22) 
is  satisfiable,  which  in  turn  holds  if  and  only  if  left-hand  side  (EXACT-AOMM)  is 
satisfiable.  The  matrix  in  Figure  6  is  of  size  (2 n  +  5m)  by  (n  +  2m  +  1),  thus  we 
have  a  polynomial-time  reduction  from  an  instance  of  3SAT  to  an  instance  of  EXACT- 
AOMM.  Thus  EXACT-AOMM  is  NP-hard,  and  by  the  result  in  section  2.2,  AOMM  is 
NP-hard  as  well. 


